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Abstract. Tensors have found application in a variety of fields, ranging from chemometrics 
to signal processing and beyond. In this paper, we consider the problem of multilinear modeling 
of sparse count data. Our goal is to develop a descriptive tensor factorization model of such data, 
along with appropriate algorithms and theory. To do so, we propose that the random variation is 
best described via a Poisson distribution, which better describes the zeros observed in the data as 
compared to the typical assumption of a Gaussian distribution. Under a Poisson assumption, we 
fit a model to observed data using the negative log-likelihood score. We present a new algorithm 
for Poisson tensor factorization called CANDECOMP-PARAFAC Alternating Poisson Regression 
(CP-APR) that is based on a majorization-minimization approach. It can be shown that CP-APR 
is a generalization of the Lee-Seung multiplicative updates. We show how to prevent the algorithm 
from converging to non-KKT points and prove convergence of CP-APR under mild conditions. We 
also explain how to implement CP-APR for large-scale sparse tensors and present results on several 
data sets, both real and simulated. 

Key words. Nonnegative tensor factorization, nonnegative CANDECOMP-PARAFAC, Poisson 
tensor factorization, Lee-Seung multiplicative updates, majorization-minimization algorithms 

1. Introduction. Tensors have found application in a variety of fields, ranging 
from chemometrics to signal processing and beyond. In this paper, we consider the 
problem of multilinear modeling of sparse count data. For instance, we may consider 
data that encodes the number of papers published by each author at each conference 
per year for a given time frame [13], the number of packets sent from one IP address 
to another using a specific port [47], or to/from and term counts on emails [2]. Our 
goal is to develop a descriptive model of such data, along with appropriate algorithms 
and theory. 

Let % represent an A^-way data tensor of size /i x /2 x • • • x /jy. We are interested 
in an i?-component nonnegative CANDECOMP/PARAFAC [8, 21] factor model 

R 

M = ^A, a(i)o...oaW, (1.1) 

r=l 

where o represents outer product and a^"'' represents the rth column of the nonneg- 
ative factor matrix A*-"-' of size /„ x R. We refer to each summand as a component. 
Assuming each factor matrix has been column-normalized to sum to one, we refer to 
the nonnegative A^'s as weights. 

In many applications such as chemometrics [46], we fit the model to the data 
using a least squares criteria, implicitly assuming that the random variation in the 
tensor data follows a Gaussian distribution. In the case of sparse count data, however, 
the random variation is better described via a Poisson distribution [37, 44], i.e., 

Xi ~ Poisson(mi) 
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rather than Xi ~ N{mi,(Tf), where the subscript i is shorthand for the multi- index 
(zi, 12, . . . , zat). In fact, a Poisson model is a much better explanation for the zero 
observations that we encounter in sparse data — these zeros just correspond to events 
that were very unlikely to be observed. Thus, we propose that rather than using the 
least squares (LS) error function given by J2i l^i ~ 'riip, for count data we should 
instead minimize the (generalized) Kullback-Leibler (KL) divergence 



which equals the negative log-likelihood of the observations up to an additive constant. 
Unfortunately, minimizing KL divergence is more difhcult than LS error. 

1.1. Contributions. Although other authors have considered fitting tensor data 
using KL divergence [50, 9, 51], we offer the following contributions: 

• We develop alternating Poisson regression for nonncgative CP model (CP- 
APR). The subproblems are solved using a majorization- minimization (MM) ap- 
proach. If the algorithm is restricted to a single inner iteration per subproblem, it 
reduces to the standard Lee-Seung multiplicative for KL updates [29, 30] as extended 
to tensors by Welling and Weber [50]. However, using multiple inner iterations is 
shown to accelerate the method, similar to what has been observed for LS [19] 

• It is known that the Lee-Seung multiplicative updates may converge to a non- 
stationary point [20] , and Lin [32] has previously introduced a fix for the LS version of 
the Lee-Seung method. We introduce a different technique for avoiding inadmissible 
zeros (i.e., zeros that violate stationarity conditions) that is only a trivial change to the 
basic algorithm and prevents convergence to non-stationary points. This technique is 
straightforward to adapt to the matrix and/or LS cases as well. 

• Assuming the subproblems can be solved exactly, we prove convergence of the 
CP-APR algorithm. In particular, we can show convergence even for sparse input 
data and solutions on the boundary of the nonnegative orthant. 

• We explain how to efficiently implement CP-APR for large-scale sparse data. 
Although it is well-known how to do large-scale sparse tensor calculations for the 
LS fitting function [3], the Poisson likelihood fitting algorithm requires new sparse 
tensor kernels. To the best of our knowledge, ours is the first implementation of any 
KL-divergence-based method for large-scale sparse tensors. 

• We present experimental results showing the effectiveness of the method on 
both real and simulated data. In fact, the Poisson assumption leads quite naturally 
to a generative model for sparse data. 

1.2. Related Work. Much of the past work in nonnegative matrix and tensor 
analysis has focused on the LS error [42, 41, 6, 20, 25, 23, 17], which corresponds to an 
assumption of normal independently identically distributed (i.i.d.) noise. The focus 
of this paper is KL divergence, which corresponds to maximum likelihood estimation 
under an independent Poisson assumption; see §2.2. The seminal work in this domain 
are the papers of Lee and Seung [29, 30], which propose very simple multiplicative 
update formulas for both LS and KL divergence, resulting in a very low cost-per- 
iteration. Welling and Weber [50] were the first to generalize the Lee and Seung 
algorithms to nonnegative tensor factorization (NTF). Applications of NTF based 
on KL-divergence include EEG analysis [38] and sound source separation [16]. We 
note that generalizations of KL divergence have also been proposed in the literature, 
including Bregman divergence [11, 10, 31] and beta divergence [9, 14]. 




(1.2) 
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In terms of convergence, Lin [32] and Gillis and Glienur [18] have shown con- 
vergence of two different modified versions of the Lee-Seung method for LS. Finesso 
and Spreij [15] (tensor extension in [51]) have shown convergence of the Lee-Seung 
method for KL divergence; however, we show later that numerical issues arise if the 
iterates come near to the boundary. This is related to the problems demonstrated by 
Gonzalez and Zhang [20] that show, in the case of LS loss, the Lee and Seung method 
can converge to non-KKT points; we show a similar example for KL divergence in 
§6.2. 

Our convergence theory is not focused on the Lee-Seung algorithm but rather on 
a Gauss-Seidel approach. The closest work is that of Lin [33] in which he considers 
the matrix problem in the least squares sense; in the same paper, he dismisses the KL 
divergence problem as ill-defined but we address that issue in this paper by showing 
that the convex hull of the level sets of the KL divergence problem are compact. 

2. Notation and Preliminaries. 

2.1. Notation. Throughout, scalars are denoted by lowercase letters (a), vectors 
by boldface lowercase letters (v), matrices by boldface capital letters (A), and higher- 
order tensors by boldface Euler script letters {%) . We let e denotes the vector of all 
ones and E denotes the matrix of all ones. The jth column of a matrix A is denoted by 
SLj. We use multi-index notation so that a boldface i represents the index (ii, . . . , i^). 
We use subscripts to denote iteration index for infinite sequences, and the difference 
between its use for an entry and its use as an iteration index should be clear by 
context. 

The notation ]| • ]| refers to the two-norm for vectors or Frobenious norm for 
matrices, i.e., the sum of the squares of the entries. The notation ]] • [Ji refers to the 
one-norm, i.e., the sum of the absolute values of the entries. 

The outer product is denoted by o. The symbols * and represents elementwise 
multiplication and division, respectively. The symbol denotes Khatri-Rao matrix 
multiplication. The mode-n matricization or unfolding of a tensor % is denoted by 
X(„). See Appendix A for further details on these operations. 

2.2. The Poisson Distribution and KL Divergence. In statistics, count 
data is often best described as following a Poisson distribution. For a general discus- 
sion of the Poisson distribution, see, e.g., [44]. We summarize key facts here. 

A random variable X is said to have a Poisson distribution with parameter /i > 
if it takes integer values x = 0, 1, 2, . . . with probability 

PiX^x)^"-^. (2.1) 

The mean and variance of X are both fi; therefore, the variance increases along with 
the mean, which seems like a reasonable assumption for count data. It is also useful 
to note that the sum of independent Poisson random variables is also Poisson. This 
is important in our case since each Poisson parameter is a multilinear combination of 
the model parameters. We contrast Poisson and Gaussian distributions in Figure 2.1. 
Observe that there is good agreement between the distributions for larger values of 
the mean, /i. For small values of /i, however, the match is not as strong and the 
Gaussian random variable can take on negative values. 

We can determine the optimal Poisson parameters by maximizing the likelihood 
of the observed data. Let x be a vector of observations and let /x be the vector of 
Poisson parameters. (We assume that /i^'s are not independent, else the function 
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Fig. 2.1: Illustration of Gaussian and Poisson distributions for two parameters. For 
both examples, we assume that the variance of the Gaussian is equal to the mean m. 



would entirely decouple in the parameters to be estimated.) Then the negative of the 
log of the likelihood function for (2.1) is the KL divergence 

- Xilog^i, (2.2) 

i 

excepting the addition of the constant term log(a;i!), which is omitted. 

Because we are working with sparse data, there are many instances for which 
we expect Xi — which leads to some ambiguity in (2.2) ii = 0. We assume 
throughout that • \og{^) — for all > 0. This is for notational convenience; else, 
we would write (2.2) as M* - Htx.^o log Mi- 

3. CP- APR: Alternating Poisson Regression. In this section we introduce 
the CP- APR algorithm for fitting a nonnegative Poisson tensor decomposition (PTF) 
to count data. The algorithm employs an alternating optimization scheme that se- 
quentially optimizes one factor matrix while holding the others fixed; this is non- 
linear Gauss-Seidel applied to the PTF problem. The subproblems are solved via a 
majorization-minimization (MM) algorithm, as described in §4. 

3.1. The Optimization Problem. Our optimization problem is defined as 
min /(M) = ^TOi-Xilogmi s.t. M = |A; A^^^ . . . , A^^'l G f7, (3.1) 

i 

where = ri\ x ili x • • • x fi,, with 

(3 2) 

r^A [0, +oo)^ and r2„ = { A e [0, l]-^"""-"- I lla^lli = 1 for r = 1,.. } . 

Here M = |A; A^^\ . . . , A^^^] is shorthand notation for (1.1) [3]. Depending on 
context, JA represents the tensor itself or its constituent parts. For example, when 
we say M e il, it means that that the factor matrices have stochasticity constraints 
on the columns. 

The function / is not finite on all of Vl. For example, if there exists i such that 
TOi = and Xi > 0, then /(M) = +oo. If rrij > for all i such that a;i > 0, however, 
then we are guaranteed that /(M) is finite. Consequently, we will generally wish to 
restrict ourselves to a domain for which /(M) is finite. We define 



f](C)=conv({Mef}|/(M)<C}), 



(3.3) 
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where conv(-) denotes the convex hull. We observe that ri(C) C (strict subset) 
since, for example, the all-zero model is not in il{C)- The following lemma states that 
ri(C) is compact for any C > 0; the proof is given in Appendix B. 

Lemma 3.1. Let f be as defined in (3.1) and ^1{C) be as defined in (3.3). For any 
C > 0, i^(C) is compact. 

3.2. CP-APR Main Loop: Nonlinear Gauss-Seidel. We solve problem 
(3.1) via an alternating approach, holding all factor matrices constant except one. 
Consider the problem for the nth factor matrix. We note that there is scaling ambi- 
guity that allows us to express the same M in different ways, i.e.. 



M = 



A(^\ . . . , A("^^',B("\ A("+^', . . . , A^^) 



(3.4) 



where B^"' = A("'A and A = diag(A). (3.5) 

The weights in (3.4) are omitted because they are absorbed into the nth mode. From 
[3], we can express M as M(„) = B^^^H^") where B^") is defined in (3.5) and 

n(") = (^a'-^) ... a("+i) a("-i) • • • A^i')"^ . (3.6) 

Thus, we can rewrite the objective function in (3.1) as 

/(M) = [B(")n(") - X(„) * log (B(")n("))] e, 

where e is the vector of all ones, * denotes the elementwise product, and the log 
function is applied elementwise. We note that it is convenient to update A*-"' and A 
simultaneously since the resulting constraint on B^"-* is simply B'"-* > 0. 

Thus, at each inner iteration of the Gauss-Seidel algorithm, we optimize /(DVt) 
restricted to the nth block, i.e., 

B(") = argmin/„(B) = [bH^") - X(„) * log (bU^^A] e. (3.7) 

The updates for A and A*-"-* come directly from B*-"'. Note that some care must 
be taken if an entire column of B^"-* is zero; if the rth column is zero, then we can 
set Xr — and b^"-* to an arbitrary nonnegative vector that sums to one. The full 
procedure is given in Algorithm 1; this is a variant (because of the handling of A) of 
nonlinear Gauss-Seidel. We note that the scaling and unsealing of the factor matrices 
is common in alternating algorithms, though not always explicit in the algorithm 
statement. There are many variations of this basic device; for instance, in the context 
of the LS version of NTF, [17, Algorithm 2] collects the scaling information into an 
explicit scaling vector that is "amended" after each inner iteration 

We defer the proof of convergence until §3.3, but we discuss how to check for 
convergence here. First, we mention an assumption that is important to the theory 
and also arguably practical. Let 

5i"^ = {j I (X(„)),, >0} (3.8) 

denote the set of indices of columns for which the ith row of '^(n) is non-zero. If 
iV = 3, then X(]^)(i, :) corresponds to a vectorization of the ith horizontal slice of DC, 
X(2)(i, to a vectorization of the ith lateral slice, and X(3)(i, :) to a vectorization of 
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Algorithm 1 CP- APR Algorithm (Ideal Version) 

Let X be a tensor of size /i x • • • x Let M = |A; A^"^-*, • • • , A^^^] be an initial 
guess for an i?-component model such that Me f^(C) for some ^ > 0. 
1; repeat 

2: for n = 1, . . . , do 

3: n ^ (a(^) • • • a("+^) a("-i) • • • A^^^y 

4: B argmine^ [BII — ^{n) * log (BII)] e > subproblem 

5: A ^ e^B 

6: A("' ^ BA^ 

7: end for 

8; until convergence 



the ith frontal slice. More generally, we can think of vectorizing "hyperslices" with 
respect to each mode. 

Assumption 3.2. The rows of the suhmatrixllS^\:,s\^'^) (i.e., only the columns 
corresponding to nonzero rows in X(„) are considered) are linearly independent for all 
i = 1, . . . , In and n = \, . . . ,N . 

Assumption 3.2 implies that \S\"^\ > R for all i. Thus, we need to observe at 
least R ■ max„ /„ counts in the data tensor 3C, and the counts need to be sufficiently 
distributed across "X. Consequently, the conditions appeal to our intuition that there 
are concrete limits on how sparse the data tensor can be with respect to how many 
parameters we wish to fit. If, for example, we had X(i)(i, :) = 0, it is clear that we 
can remove element i from the first dimension entirely since it contributes nothing. 
We are making a stronger requirement: each element in each dimension must have at 
least R nonzeros in its corresponding hyperslice. 

A potential problem is that Assumption 3.2 depends on the current iterate, which 
we cannot predict in advance. However, we observe that if A > and the factor 
matrices have random uniform [0,1] positive entries and R < min„ J^^_^j^ /„, then 
this condition is satisfied with probability one^ . This condition can be checked as the 
iterates progress. 

The matrix 

n(")T, (3.9) 

with denoting elementwise division, will come up repeatedly in the remainder of 
the paper. For instance, we observe that the partial derivative of / with respect to 
A^"^ is 9//0A^"^ = (E - 4>^"^)A, where E is the matrix of aU ones. Consequently, 
the matrix ^'■"^ plays a role in checking convergence as follows. 

Theorem 3.3. IfX>OandM= |A; A^^^, . . . , A^^^] e n{C) for some C > 0, 
then "M. is a Karush-Kuhn- Tucker (KKT) point of (3.1) if and only if 

min(A("\E = /or n = 1, . . . , iV. (3.10) 



X(„)0 (B(")n(") 



^We can actually appeal to a weaker assumption; if the entries are drawn from any distribution 
that is absolutely continuous with respect to the Lebesgue measure on [0,1] then the condition is 
satisfied with probability one. 
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Proof. Since A > 0, we can assume that A has been absorbed into A*-™-* for some 
TO. Thus, we can replace the constraints A e f^A and A*-™-* e ri„ with B*-™-* > 0. In 
this case, the partial derivatives are 

E - and = (e - A for n ^ to. (3.11) 



Since M G i^(C) for some C > 0, we know that not all elements of M are zero; thus, 
the set of active constraints are linearly independent. The following conditions define 
a KKT point [40]: 

E - - t'"^ = 0, 

(E-*("))A-T(")-e(r7("))T==0, eTA(") = l for n^m 

V / ; I 7- (3 12) 

^(«) > ^(n) > ^(n) ^ j^{n) ^ Q for n 7^ TO 
g(m) > > Y(m) ^ g(m) ^ 

Here Y*-"-' are the Lagrange multipliers for the nonnegativity constraints and t]^"'' are 
the Lagrange multipliers for the stochasticity constraints. 

If M = (A; A(^\ . . . , A^^)) is a KKT point, then from (3.12), we have that T^™) = 
E - #(™) > 0, B(") > 0, and T^") * B^'") = 0. Thus, min(A('"' A, E - $(™)) = 0. 
Since A > and to is arbitrary, (3.10) follows immediately. 

If, on the other hand, (3.10) is satisfied, choosing T*") = E - and T^"' = 

(E - *("))A and r;(") = for n 7^ to satisfies the KKT conditions in (3.12). Hence, 
M must be a KKT point. □ 

Observe that the condition A > makes A moot in the KKT conditions — this 
reflects the scaling ambiguity that is inherent in the model. 

From Theorem 3.3 and because feasibility is always maintained, we can check for 
convergence by verifying | min(A^"\ E — )| < r for n = 1, . . . , A^, where r > is 
some specified convergence tolerance. 

3.3. Convergence Theory for CP-APR. We require the strict convexity of 
/ in each of the block coordinates. This is ensured under Assumption 3.2. 

Lemma 3.4 (Strict convexity of subproblem). Let fn{-) he the function f re- 
stricted to ttie nth block as defined in (3.7). If Assumption 3.2 is satisfied, then /„(B) 
is strictly convex over B„ = {B € [0, +00)'"^^ : BII^"^ 7^ 0}. 

Proof. In the proof, we drop the rt's for convenience. First note that B is convex. 
Let C = B^. We can rewrite (3.7) as min/(C^) = J^ij ^J'^j~^ij log(c7^i) subject to 
C > 0. Hence, it is sufficient to show that the function /(C) — — J^ij ^ij ^'^si^J'^^j) 
is strictly convex over the convex set C = {C € [0, +cx))^^^" : C^II ^ 0}. Fix 
C, C € C such that C 7^ C. Since the inner product is affine and log is a strictly 
concave function, we need only show that there exists some i and j such that Xij 7^ 
and cJtTj 7^ cjwj. We know at least one column must differ since C 7^ C; let i 
correspond to that column and define d = — c, 7^ 0. By Assumption 3.2, we know 
that n(:, Si) has full row rank. Thus, there exists a column j of 11 such that Xij 7^ 
and d^TTj 7^ 0. Hence, the claim. □ 

Here we state our main convergence result. Although this result assumes that the 
subproblems can be solved exactly (which is not the case in practice), it gives some 
idea as to the convergence behavior of the method. We follow the reasoning of the 
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proof of convergence of nonlinear Gauss-Seidel [5, Proposition 3.9], adapted here for 
the way that A is handled. 

Theorem 3.5 (Convergence of CP-APR). Suppose that /(M) is strictly convex 
with respect to each block component and that it is minimized exactly for each block 
component subproblem of CP-APR. Let be a limit point of the sequence {JVik} 
such that A* > 0. Then DVt* is a KKT point of (3.1). 

Proof. Let Mfc = (A^, A^""^-*, . . . , A^^-*) be the kth iterate produced by the outer 
iterations of Algorithm 1. Define Z^"-* to be the nth iterate in the inner loop of outer 
iteration k with the A- vector absorbed into the nth factor, i.e., 

^(«)_/a(i) a("-i) r(") a("+i) a(^)\ 

where B^'^^-^ is the solution to the nth subproblem at iteration k. This defines A^.'^^j^ 

to be the column-normalized version of B^,"^^^, i.e., A^,"^^^ = B^'^^^(diag(B^'^^^e))~^. 
Taking advantage of the scaling ambiguity to shift the weights between factors yields 

f{%t') = f{{Ai%, . . . , Ai;V\ diag(Bi';\e), A^^ . . . , Af))), 
= /((A«, . . . , Ai;V\ A^, Ai"+i) cliag(Bi;\e), . . . , Af))), 
> /((A«„ . . . , Ai;-), Ai",),, Bi-^), . . . , Ai-))) = /(2,r^)). 

Observe that Z^.^-* = (A^^-^, . . . , A^^^ a[,^'^ diag(Afe4.i)), so there is a correspon- 
dence between Zi^^' and Ivlk+i such that /(Z^^'') = /(M^+i). For convenience, 
we define Zsf^ = (A^^'' diag(A^.), A^,^-*, . . . , A^^-*). Since we assume the subproblem is 
solved exactly at each iteration, we have 

/(Mfe) > /(2,«) > /(Zf ) > • • • fizi""-'^) > /(Mfe+i) for all k. (3.13) 

Recall that ri(C) is compact by Lemma 3.1. Since the sequence {M^} is contained 
in the set ri(C), it must have a convergent subsequence. We let {k(} denote the indices 
of that convergent subsequence and M* = (A*, A{^\ . . . , A^^-*) denote its limit point. 
By continuity of /, /(MfeJ /(M*). 

We first show that || a[,"|,^^j^ — a[,"^-' || 0. Assume the contrary, i.e., that it does not 

converge to zero. Let = HX^^"* — Z^^"*!!. By possibly restricting to a subsequence 
of {ki}, we may assume there exists some 70 > such that j{ki) > 70 for all £. Let 
Si'^ = {Zi^ - 2»i°V7..; then 2,^ = +7..si^;, ||sW|| = 1, and differs from 
zero only along the first block component. Notice that {Sg''} belong to a compact 
set and therefore has a limit point S^^K By restricting to a further subsequence of 
{ki}, we assume that S^^^ — > S^^' 

Let us fix some e e [0, 1]. Notice that < 670 < jkf Therefore, z'f^ + £708^^'' 

lies on the line segment joining z'j^^ and 2i[,°^ + 7fejSg^ = and belongs to ^2(C) 
because ri(C) is convex. Using the convexity of / w.r.t. the first block component 
and the fact that Z^^^ minimizes / over all Z, that differ from Z^j}^ in the first block 
component, we obtain 

fi<') = .f<' + 7..SW) < /(Z£) + .7oSW) < ^(2,(0)). 
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Since — fCM-kf) — > /(M*), equation (3.13) shows that /(Z^.^-*) also converges 

to /(M*). Taking hmits as £ tends to infinity, we obtain 

/(M,)</(Zl°)+e7oSW) </(M,), 

where 2:^''-' is just with A* absorbed into the first component. We conclude that 
/(M*) = /(2,1") + e7oSl^)) for every e e [0, 1]. Since 7oSl^) ^ 0, this contradicts the 
strict convexity of / as a function of the first block component. This contradiction 
establishes that ||A^^^^^ — A^^'*!] — > 0. In particular, converges to 

By definition of Zs^j^^ and the assumption that each subproblem is solved exactly, 
we have 

fizi'^) < /((B, A^^,. . . , Ai^))) for aU B > 0. 

Taking limits as — > oo, we obtain 

/(M,) </((B,a12),...,aW)) foraUB>0. 

In other words, B^^' ~ A^^-* diag(A») is the minimizer of / with respect to the first 
block components with the remaining components are fixed at A^^' through A^^-*. 
From the KKT conditions [40], we have that 

B«>0, ^(bW)>0, bW*^(B«)=0. 
In turn, since A* > 0, we have min(A^^\ E — $1^^) = 0. 

(2t) ('2) 

Repeating the previous argument shows that ||A^^^^-^ — — >■ and that 
min(Al2\E - ^f'>) = 0. Continuing inductively, min(Al"\E - = for n = 

1, . . . , iV. Thus, by Theorem 3.3, M* is a KKT point of /(M). □ 

Before proceeding to the discussion solving the subproblem, we point out that 
remarkably very little is assumed about the objective function / in Theorem 3.5. The 
proof required that / is differentiable, strictly convex in each of its block components, 
and there is a ^ > such that the level set is compact. The upshot is that 

Theorem 3.5 applies equally well to other choices of / corresponding to other mem- 
bers in the family of beta distributions that are convex, namely the divergences that 
correspond to /? G [1,2] [14]. In fact, it was also observed in [17] that "rescaling does 
not interfere with the convergence of the Gauss-Seidel iterations" (in the context of 
the LS formulation of NTF). 

4. Solving the CP-APR Subproblem via Majorization-Minimization. 

The basic idea of a majorization-minimization (MM) algorithm is to convert a hard 
optimization problem (e.g., non-convex and/or non-difFerentiable) into a series of sim- 
pler ones (e.g., smooth convex) that are easy to minimize and that majorize the 
original function, as follows. 

Definition 4.1. Let f and g be real-valued functions on M" and M" x W\ 
respectively. We say that g majorizes f at x Cz M" if g{y,Ti) > /(y) for all y G M" 
and 5(x,x) = /(x). 

If /(x) is the function to be optimized and g(-, x) majorizes / at x, the basic MM 
iteration is Xfe-|_i = argmiuy ^(y, Xfc). It is easy to see that such iterates always take 
non-increasing steps with respect to / since /(x^+i) < ^(xfc+i, x^) < g{Tik,Tik) = 
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Algorithm 2 Subproblcm Solver for Algorithm 1 



B ^ A(")A 

repeat > subproblem loop 

* ^ (X(„) (Bn)) 

B ^ B * * 

until convergence 



/(x/j), where is the current iterate and x^+i is the optimum computed at that 
iterate. 

Consider the nth subproblem in (3.7). Here we drop the n's for convenience so 
that (3.7) reduces to 

mm /(B) = e"^ [BH - X * log(Bn)] e. (4.1) 

Recall that X is the nonnegative data tensor reshaped to a matrix of size / x J, 11 is a 
nonnegative matrix of size Rx J with rows that sum to 1, and B is a nonnegative ma- 
trix of size I X R. For clarity in the ensuing discussion, we also restate Assumption 3.2 
in terms of the local variables for this section as follows: 

Assumption 4.2. The rows of the submatrix 11 (:, { j | X^j > }) (i.e., only the 
columns corresponding to nonzero rows in X are considered) are linearly independent 
for all i — 1, . . . , J. 

According to Assumption 4.2, for every i there is at least one j such that Xij > 
0. Thus, we can assume that we have B > such that /(B) is finite. We now 
introduce the majorization used in our subproblem solver. This majorization is also 
a special case of the one derived in [14] when (5 = 1 and has a long history in image 
reconstruction that predates its use in NMF [36, 45, 28]. The objective / is majorized 
at B by the function 



5(B,B) = ^ 



log 



where arij = (42) 



The proof of this fact is straightforward and thus relegated to Appendix C. The 
advantage of this majorization is that the problem is now completely separable in 
terms of bir, i.e., the individual entries of B. Moreover, (7(-,B) has a unique global 
minimum with an analytic expression, given by B * where $ is as defined in (3.9) 
and depends on B. A proof is provided in Appendix C. The MM algorithm iterations 
are then defined by 

Bfe+i =7/^(Bfe) = Bfe**(Bfc), where *(Bfc) = [X (B^H)] H (4.3) 

and X and 11 come from (4.1). If Bq > 0, clearly B^ > for all k. Observe that 
V/(B) = E — $(B). We discuss in §5.3 how to exploit this simple relationship to 
quickly compute stopping rules for the algorithm. The MM algorithm to solve the 
Gauss-Seidel subproblem of Line 4 in Algorithm 1 is given in Algorithm 2. 

The monotonic decrease in objective function does not guarantee that the MM 
iterates will converge to the desired global minimizer of the subproblem. Nonethe- 
less, the following theorem shows that, under mild conditions on the starting point 
Bq (discussed further in §5.2), the MM iterates will converge to the unique global 
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minimum of (4.1). The proof follows the reasoning of the convergence proof of an al- 
gorithm for fitting a regularized Poisson regression problem given in [26] and is given 
in Appendix D. 

Theorem 4.3 (Convergence of MM algorithm). Let f be as defined in (4.1) and 
assume Assumption 4.2 holds, let Bg be a nonnegative matrix such that /(Bq) is finite 
and (Bo)j^ > for all {i,r) such that (^(B*)),^ > 1, and let the sequence {B^.} be 
defined as in (4.3). Then {B^} converges to the global minimizer of f . 

Note that we make a modest but very useful generalization of existing results by 
allowing iterates to be on (or very close to) the boundary. Prior convergence results, 
including [26, 15, 51] assume that all iterates are strictly positive. Though true in 
exact arithmetic, in numerical computations it is not uncommon for some iterates to 
become zero numerically. In §5.2, we show how to ensure the condition on Bp holds 
in practice. 

5. CP-APR Implementation Details. The previous algorithms omit many 
details and numerical checks that arc needed in any practical implementation. Thus, 
Algorithm 3 provides a detailed version that can be directly implemented. A highlight 
of this implementation is the "inadmissible zero" avoidance, which fixes a the problem 
of getting stuck at a zero value with multiplicative updates. 

5.1. Lee-Seung is a special case of CP- APR. If we only take one iteration 
of the subproblem loop (i.e., setting fmax = l)i then CP- APR is the Lee-Seung multi- 
plicative update algorithm for the KL divergence. Thus, we can view the Lee-Seung 
algorithm as a special case of our algorithm where we do not solve the subproblems 
exactly; quite the contrary, we only take one step towards the subproblem solution. 

5.2. Inadmissible Zero Avoidance. A well-known problem with multiplica- 
tive updates is that some elements may get "stuck" at zero; see, e.g., [20]. For 
example, if al"-* = 0, then the multiplicative updates will never change it. In many 
cases, a zero entry may be the correct answer, so we want to allow it. In other cases, 
though, the zero entry may be incorrect in the sense that it does not satisfy the KKT 

(n) (n) 

conditions, i.e., a^^ =0 but 1 — < 0. We refer to these values as inadmissible 
zeros. We correct this problem before we enter into the multiplicative update phase 
of the algorithm. In lines 4-5 of Algorithm 3, any inadmissible zeros (or near-zeros) 
are "scooched" away from zero and into the interior. The amount of the scooch is 
controlled by the user-defined parameter n. The condition in Theorem 4.3 is exactly 
that the starting point should not have any zeros that are ultimately inadmissible. If 
we discover that a sequence of iterates leads to an inadmissible zero (or almost-zero), 
we restart the method by restarting the method with a new starting point. This 
adjustment prevents convergence to non-KKT points. Note that all the quantities 
needed to perform the check are precomputed and that there is no change to the 
algorithm besides adjusting a few zero entries in the current factor matrix. The fix 
for the inadmissible zeros is compatible with the Lee-Seung algorithm for LS error as 
well. 

Lin [32] has made a similar observation in the LS case and applied changes to 
his gradient descent version of the Lee-Seung method. Our correction is different and 
is directly incorporated into the multiplicative update scheme rather than requiring 
a different update formula. Gillis and Glineur [18] proposed a more drastic fix by 
restricting the factor matrices to have entries in [e, oo) for some small positive e. 
Avoiding all zeros clearly rules out the possibility of getting stuck at an inadmissible 



12 



E. C. Chi and T. G. Kolda 



Algorithm 3 Detailed CP- APR Algorithm 



Let X be a tensor of size /i x • • • x /^r. Let M = (A; A^"'^-', • • • , A*-^-*) be an initial 
guess for an i?-coniponent model such that Me ^^(C) for some ^ > 0. 
Choose the following parameters: 

• ^max — Maximum number of outer iterations 

• -^max — Maximum number of inner iterations (per outer iteration) 

• T — Convergence tolerance on KKT conditions (e.g., 10~^) 

• K = Inadmissible zero avoidance adjustment (e.g., 0.01) 

• Ktoi — Tolerance for identifying a potential inadmissible zero (e.g., 10~^°) 

• e = Minimum divisor to prevent divide-by-zero (e.g., 10"^") 

for fc = 1,2, . . . ,fc„iax do 
isConverged -s— true 
for n = 1, . . . , N do 

_ |k, if fc > l,A(")(z,r) < Ktoi, and *("H«,r) > 1, 
1 0, otherwise 
B ^ (A(") + S)A 



S(*, 



n ^ ( aW • • • o a("+i^ a("-i) • • • a(i) ' ^ 



for £ — 1,2,..., i?inax do subproblem loop 

^ (X(„) (max(Bn,e))) 
if I min(B,E -*("))! <r then 

break 
end if 

isConverged ^ false 
B ^ B * 
end for 
e^B 

end for 

if isConverged = true then 

break 
end if 
end for 



zero, but does so at the expense of eliminating any hope of obtaining sparse factor 
matrices, a desirable property in many applications. 

5.3. Practical Considerations on Convergence. The convergence condi- 
tions on the subproblem require that min(B*-"-',E — $'^")) = 0. We do not require 
the value to be exactly zero but instead check that it is smaller in magnitude than 
the user-defined parameter r. We break out of the subproblem loop as soon as this 
condition is satisfied. 

From Theorem 3.3, we can check for overall convergence by verifying (3.10). We 
do not want to calculate this at the end of every n-loop because it is expensive. 
Instead, we know that the iterates will stop changing once we have converged and 
so we can validate the convergence of all factor matrices by checking that no factor 
matrix has been modified and every subproblem has converged. 
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5.4. Sparse Tensor Implementation. Consider a large-scale sparse tensor 
that is too large to be stored as a dense tensor requiring Y[n memory. In this 
case, we can store the tensor as a sparse tensor as described in [3], requiring only 
{N + 1) • nnz(!)C) memory. 

The elementwise division in the update of $ requires that we divide the tensor 
(in matricized form) X by the current model estimate (in matricized form) M = BII. 
Unfortunately, we cannot afford to store M explicitly as a dense tensor because it 
is the same size as %. In fact, we generally cannot even form 11 explicitly because 
it requires almost as much storage as M. We observe, however, that we need only 
calculate the values of M that correspond to nonzeros in X. 

Let P — nnz(3C). Then we can store the sparse tensor X as a set of values and 
multi-indices, {v^^\ i^^^) for p = 1, . . . , P. In order to avoid forming the current model 
estimate, JA., as a dense object, we will store only selected rows of 11, one per nonzero 
in %; we denote these rows by w'-^'^ for p = 1, . . . , P. The pth vector is given by the 
elementwise product of rows of the factor matrices, i.e., 

w(^') ^ A(l)(^(''^ :) * ... * A("-i)(zi^2„ :) * A(«+i)(^|f|„ :).■■■. AW(^i,^^ :). 

In order to determine jC = X0M in the calculation of we proceed as follows. The 
tensor "X will have the same nonzero pattern as DC, and we let ■O^^-' denote its values. 
It can be determined that 

=a;(PV(w<''^A(")(^(f^:)^. 
To calculate $ — Xn, we simply have 

■(p) 

The storage of the w^p-' for p = 1, . . . , P vectors and the entries v'^'p^ requires (P+ 1)P 
additional storage. 

6. Numerical Results for CP-APR. 

6.1. Comparison of Objective Functions for Sparse Count Data. We 

contend that, for sparse count data, KL divergence (1.2) is a better objective function. 
To support our claim, we consider simulated data where we know the correct answer. 
Specifically, we consider a 3-way tensor (A'' = 3) of size 1000 x 800 x 600 and i? = 10 
factors. It will be generated from a model M = |A; A*-^-*, . . . , A^^-*]. The entries 
of the vector A are selected uniformly at random from [0,1]. Each factor matrix 
A^"-* is generated as follows: (1) For each column in A*-"-*, randomly select 10% 
(i.e., 1/P) of the entries uniformly at random from the interval [0,100]. (2) The 
remaining entries are selected uniformly at random from [0,1]. (3) Each column 
is scaled so that its 1-norm is 1 (i.e., its sum is 1). An "observed" tensor can be 
thought of as the outcome of tossing v <^Y\In balls into Y\ In empty urns where 
each entry of the tensor corresponds to an urn. For each ball, we first draw a factor 
r with probability \r/'Y^\r- The indices {i,j,k) are selected randomly proportional 
to ar"' for n = 1,2,3. In other words, the ball is then tossed into the (j,j, A:)th 
urn with probability ct^ii^ a'^j^J a^/^J . In this manner, the balls are allocated across the 
urns independently of each other. This procedure generates entries xi that are each 
distributed as Poisson(TOi). We adjust the final A so that the scale matches that of 
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Least Squares 






KL Diver 


gence 




Lee-Seung LS 


CP-ALS 


Lee-Seung KL 


CP-APR 


Observations 


FMS 


#Cols 


FMS 


#Cols 


FMS 


#Cols 


FMS #Cols 


480000 (0.100%) 


0.58 


6.4 


0.71 


7.3 


0.89 


8.7 


0.96 9.5 


240000 (0.050%) 


0.51 


5.4 


0.72 


7.4 


0.83 


8.2 


0.91 9.2 


48000 (0.010%) 


0.37 


3.8 


0.59 


6.3 


0.76 


7.5 


0.80 7.9 


24000 (0.005%) 


0.33 


3.5 


0.51 


5.7 


0.72 


6.6 


0.74 6.9 



Table 6.1: Accuracy comparison (mean of 10 trials) using the factor match score 
(FMS) and the number of columns correctly identified in the first factor matrix. 



%, i.e., A z/A/||A||. We generate problems where the number of observations ranges 
from 480,000 (0.1%) down to 24,000 (0.005%). RecaU that Assumption 3.2 implies 
that the absolute minimum number of observations is R ■ max„ /„ = 10, 000. We have 
used very few observations, as real problems do indeed tend to be this sparse. 

Table 6.1 shows comparisons of four methods. The first two are optimizing LS: 
Lee-Seung for LS and alternating LS with no nonnegativity constraints (CP-ALS). 
The last two are optimizing KL divergence: Lee-Seung for KL divergence and our 
method (CP- APR). We have also tested the modified Lee-Seung method of Finesso 
and Spreij [15, 51], but it is only a scaled version of the Lee-Seung method for KL 
divergence and gave nearly identical results which are omitted. All implementations 
are from Version 2.5 of Tensor Toolbox for MATLAB [4, 3, 2]; exact parameter settings 
are provided in Appendix E. We report the factor match score (FMS), a measure in 
[0, 1] of how close the computed solution is to the true solution. A value of 1 is ideal. 
Since the FMS measure is somewhat abstract, we also report the number of columns 
in the first factor matrix such that the cosine of the angle between the true solution 
and the computed solution is greater than 0.95. A value of 10 is ideal since we have 
used R = 10. The reported values are averages over 10 problems. See Appendix E for 
precise formulas for both measures. Although these problems are extremely sparse, all 
methods are able to correctly identify components in the data. Overall, the methods 
optimizing KL divergence are superior to those optimizing least squares. We also 
observe that CP-APR is an improvement compared to Lee-Seung KL; we provide 
later evidence that this improvement is more likely due to the inadmissible zero fix 
than the extra inner iterations (which provide a benefit of enhanced speed rather than 
accuracy) . 

6.2. Fixing Misconvergence of Lee-Seung. We demonstrate the effective- 
ness of our simple fix for avoiding inadmissible zeros, as described in §5.2. Our 
technique is based on the same observation on inadmissible zeros as in Lin [32], but 
the change to the algorithm is different. As in [20], we consider fitting a rank-10 
bilinear model for a 25 x 15 dense positive matrix with entries drawn independently 
and uniformly from [0, 1]. We apply CP- APR using i^ax = 1, ^ = 10^^^, e = 0, Ktoi = 
100-einach- We do two runs: one with k = 0, corresponding to the standard Lee-Seung 
(KL version) algorithm, and the other with k = 10^ to move away from inadmissible 
zeros. In both runs we use the same strictly positive initial guess. Figure 6.1 shows 
the magnitude of the KKT residual over more than 10^ iterations. When k > 0, the 
sequence clearly convergences. On the other hand when k = the iterates appear to 
get stuck at a non-KKT point. Closer inspection of the factor matrix iterates reveals 
a single offending inadmissible zero, i.e., its partial derivative is —0.0016 but should 
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Fig. 6.1: Lee-Seung permitting inadmissible zeros (blue solid line) and avoiding inad- 
missible zeros (red dashed line). 



^max 1 5 10 

Median 0.9858 0.9858 0.9862 
Mean 0.9483 0.9514 0.9603 

(a) Factor match scores 
1 5 10 Cax 1 5 10 



Median 9819 7655 7290 Median 168.70 68.98 55.00 
Mean 16370 11710 11660 Mean 299.60 106.10 87.92 



(b) Number of multiplicative updates (c) Time (seconds) 

Table 6.2: CP- APR with different values of -^max for sparse count data over 100 trials. 



be nonnegative. Hence, we use positive values of k in our experiments. 

6.3. The Benefit of Extra Inner Iterations. We show that increasing the 
maximum number of inner iterations £max can accelerate the convergence in Table 6.2. 
Recall that ^max = 1 corresponds to the Lee-Seung algorithm [29, 50]. We consider a 3- 
way tensor {N = 3) of size 500x400x300 and R — h factors. We generate 100 problem 
instances from 100 randomly generated models M = JA; A^^-*, . . . , A'^-*] as described 
in §6.1 with 0.1% observations. We compare CP-APR with fmax = 1, 5, and 10. and 
the other parameters set as fc,„ax = 10^, t = 10^**, n = 10^^, Ktoi 100 • emach, e = 0. 
We track both the number of multiplicative updates (line 8 of Algorithm 3) and the 
CPU time using the MATLAB command cputime. The experiments were performed 
on an iMac computer with a 3.4 GHz Intel Core i7 processor and 8 GB of RAM. 
Table 6.2a reports the FMS scores as we vary ^max, and we observe that the value of 
^max does not significantly impact accuracy. However, we observe that increasing ^max 
can decrease the overall work and runtime. Tables 6.2b and 6.2c present the average 
number of multiplicative updates and total run times respectively. The distribution of 
updates and times was highly skewed as some problems required a substantial number 
of iterations. Nonetheless, we see a monotonic decrease in the number of updates and 
time as ^max increases. The differences are more substantial when comparing wall 
clock time. The reason for the disproportionate decrease in wall-clock time compared 
to the tally of updates is that the cost of the calculation of 11 (in line 6 of Algorithm 3) 
is amortized over all the subproblem iterations. 
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6.4. Enron Data. We consider the application of CP- APR to email data from 
the infamous Federal Energy Regulatory Commission (FERC) investigation of Enron 
Corporation. We use the version of the dataset prepared by Zhou et al. [54] and further 
processed by Perry and Wolfe [43] , which includes detailed profiles on the employees. 
The data is arranged as a three-way tensor % arranged as sender x receiver x month, 
where entry (i, j, k) indicates the number of messages from employee i to employee j 
in month k. The original data set had 38,388 messages (technically, there were only 
21,635 messages but some messages were sent to multiple recipients and so are counted 
multiple times) exchanged between 156 employees over 44 months (November 1998 
- June 2002). We preprocessed the data, removing months that had fewer than 300 
messages and removing any employees that did not send and receive an average of at 
least one message per month. Ultimately, our data set spanned 28 months (December 
1999 - March 2002), involved 105 employees, and a total of 33,079 messages. The data 
is arranged so that the senders are sorted by frequency (greatest to least). The tensor 
representation has a total of 8,540 nonzeros (many of the messages occur between the 
same sender/receiver pair in the same time period). The tensor is 2.7% dense. 

We apply CP-APR to find a model for the data. There is no ideal method for 
choosing the number of components. Typically, this value is selected through trial and 
error, trading off accuracy (as the number of components grows) and model simplicity. 
Here we show results for R — 10 components. We use the same settings for CP- APR 
as specified in Appendix E. 

Figure 6.2 illustrates six components in the resulting factorization; the other four 
are shown in Appendix F. For each component, the top two plots shows the activity 
of senders and receivers, with the employees ordered from left to right by frequency 
of sending emails. Each employee has a symbol indicating their seniority (junior 
or senior), gender (male or female), and department (legal, trading, other). The 
sender and receiver factors have been normalized to sum to one, so the height of the 
marker indicates each employee's relative activity within the component. The third 
component (in the time dimension) is scaled so that it indicates total message volume 
explained by that component. The light gray line shows the total message volume. 
It is interesting to observe how the components break down into specific subgroups. 
For instance, component 1 in Figure 6.2a consists of nearly all "legal" and is majority 
female. This can be contrasted to component 5 in Figure 6. 2d, which is nearly all 
"other" and also majority female. Component 3 in Figure 6.2b is a conversation 
among "senior" staff and mostly male; on the other hand, "junior" staff are more 
prominent in Component 4 in Figure 6.2c. Component 8 in Figure 6.2c seems to be a 
conversation among "senior" staff after the SEC investigation has begun. Component 
10 in Figure 6.2f indicates that a couple of "legal" staff are communicating with many 
"other" staff immediately after the SEC investigation is announced, perhaps advising 
the "other" staff on appropriate responses to investigators. 

6.5. SIAM Data. As another example, we consider five years (1999-2004) of 
SIAM publication metadata that has previously been used by Dunlavy et al. [12]. 
Here, we build a three-way sparse tensor based on title terms (ignoring common stop 
words), authors, and journals. The author names have been normalized to last name 
plus initial(s). The resulting tensor is of size 4,952 (terms) x 6,955 (authors) x 11 
(journals) and has 64,133 nonzeros (0.017% dense). The highest count is 17 for the 
triad ('education', 'Schnabel B', 'SIAM Rev.'), which is a result of Prof. Schnabel's 
writing brief introductions to the education column for SIAM Review. In fact, the 
next 4 highest counts correspond to the terms 'problems', 'review', 'survey', and 
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Fig. 6.2: Components from factorizing the Enron data. 
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Qi LQ, Tseng P, Roos C, Sun 
DF, Kunisch K, Ng KF, 
Jeyakumar V, Qi HD, 
Fukushima M, Kojima M 


SIAM J Optimiz 


8 


model, nonlinear, equations, 
solutions, dynamics, waves, 
difi'usion, system, analysis, 
phase 


Venakides S, Knessl C, Sherratt 
JA, Ermentrout GB, Scherzer 
O, Haider MA, Kaper TJ, Ward 
MJ, Tier C, Warne DP 


SIAM J Appl Math 


9 


equations, flow, model, 
problem, theory, asymptotic, 
models, method, analysis, 
singular 


Klar A, Ammari H, Wegener R, 
Schuss Z, Stevens A, Velazquez 
JJL, Miura RM, Movchan AB, 
Fannjiang A, Ryzhik L 


SIAM J Appl Math 


10 


education, introduction, 
health, analysis, problems, 
matrix, method, methods, 
control, programming 


Flaherty J, Trefethen N, 
Schnabel B, [None], Moon G, 
Shor PW, Babuska IM, Sauter 
SA, Van Dooren P, Adjei S 


SIAM Rev 



Table 6.3: Highest-scoring items in a 10-term factorization of the term x author x 
journal tensor from five years of SIAM publication data. 



'techniques', and to authors 'Flaherty J' and 'Trefethen N'. 

Computing a ten-component factorization yields the results shown in Table 6.3. 
We use the same settings for CP-APR as specified in Appendix E. In the table, 
for the term and author modes, we list any entry whose factor score is greater than 
10~^ • /„, where /„ is the size of the nth mode; in the journal mode, we list any 
entry greater than 0.01. The 10th component corresponds to introductions written 
by section editors for SIAM Review. The 1st component shows that there is overlap 
in both authors and title keyword between the SIAM J. Computing and the SIAM 
J. Discrete Math. The 2nd and 3rd components have some overlap in topic and two 
overlapping authors, but different journals. Both components 8 and 9 correspond to 
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the same journal but reveal two subgroups of authors writing on slightly different 
topics. 

7. Conclusions &: Future Work. We have developed an alternating Poisson 
regression fitting algorithm, CP- APR, for PTE for sparse count data. When such data 
is generated via a Poisson process, we show that methods based on KL divergence 
such as CP- APR recovers the true CP model more reliably than methods based on LS. 
Indeed, in classical statistics, it is well known that the randomness observed in sparse 
count data is better explained and analyzed by the Poisson model (KL divergence) 
than a Gaussian one (LS error). 

Our algorithm can be considered an extension of the Lee-Seung method for KL 
divergence with multiple inner iterations (similar to [19] for LS). Allowing for multiple 
inner iterations has the benefit of accelerating convergence. Moreover, being very 
similar to an existing method, CP-APR is simple to implement with the exception 
of some details of the sparse implementation as described in §5.4. To the best of our 
knowledge, ours is the first implementation of any KL-divergence-based method for 
large-scale sparse tensors. 

In §3.3, we provide a general-purpose convergence proof for the alternating Gauss- 
Seidel approach. The regularity conditions imposed in our proofs make rigorous and 
concrete our intuition that in the context of sparse count data, CP-APR will con- 
verge provided that the data tensor meets a minimal density and that nonzeros are 
sufficiently spread throughout the data tensor with respect to the size of the factor 
matrices being fit. Any subproblem solver can be substituted for the MM method 
without changing the theory. A benefit of the MM subproblem solver is that its multi- 
plier matrix can be used to explicitly track convergence based on the KKT conditions. 
Moreover, we observe that we can use the KKT information to identify and correct 
inadmissible zeros using a "scooch." Lin [32] had a similar observation in the LS 
case but came up with a different correction technique. We analyze convergence of 
the MM subproblem with the "scooch" in order to show that it will always converge. 
Our results are stronger than past results because they allow iterates with some zero 
entries. Even though zero entries are possible to avoid in exact arithmetic, they often 
occur numerical computations and so are important to consider. 

There remains much room for future work. Eoremost among practical considera- 
tions is speed of convergence. Although multiplicative updates are relatively simple to 
compute, CP-APR can require many iterates. One approach to accelerating conver- 
gence would be to replace the MM algorithm subproblem solver. Eor example, Kim 
et al. [24] present fast quasi-Newton methods for minimizing box-constrained convex 
functions that can be used to solve a nonnegative LS or minimum KL-divergence 
subproblem in a nonlinear Gauss-Seidel solver. A second approach is to focus on the 
sequence of outer iterates. Zhou et al. [53] provide a general quasi-Newton accelera- 
tion scheme for iterative methods based on a quadratic approximation of the iteration 
map instead of the loss. 

There has also been significant work in finding sparse factors via ^i-penalization 
for matrices [35] and tensors [39, 49, 17, 34]. Sparse factors often provide more easily 
interpreted models, and penalization may also accelerate the convergence. While the 
factor matrices generated by CP- APR may be naturally sparse without imposing an 
£i-penalty, the degree of sparsity is not currently tunable. One may also consider 
extensions of this work in the context of missing data [22, 7, 48, 1] and for alternative 
tensor factorizations such as Tucker [17]. 

Perhaps most challenging, however, are open questions related to rank and in- 



20 



E. C. Chi and T. G. Kolda 



ference. Questions about how to choose rank are not new; but given the context of 
sparse count data, might that structure be exploited to derive a sensible heuristic or 
even rigorous criterion for choosing the rank? We already see that Assumption 3.2 
imposes an upper bound on the rank to ensure algorithmic convergence. Regard- 
ing inference, our focus in this work was in thoroughly developing the algorithmic 
groundwork for fitting a PTF model for sparse count data. CP- APR can be used to 
estimate latent structure. Once an estimate is in hand, however, it is natural to ask 
how much uncertainty there is in that estimate. For example, is it possible to put 
a confidence interval around the entries in the fitted factor matrices, especially zero 
or near zero entries? Given that inference for the related but simpler case of Poisson 
regression has been worked out, we suspect that a sensible solution is waiting to be 
found. The benefits of answering these questions warrant further investigation. We 
highlight them as important topics for future research. 
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Appendix A. Notation Details. 

Outer product. The outer product of N vectors is an iV-way tensor. For example, 
(ao b o c)ijk = aihjCk- 

Elementwise multiplication and division. Let Jl and B be two same-sized tensors 
(or matrices). Then G = A * "B yields a tensor that is the same size as A. (and B) 
such that Ci = aibi for all i. Likewise, C = 71 B yields a tensor that is the same size 
as A (and B) such that Ci = ai/bi for all i. 

Khatri-Rao product. Give two matrices A and B of sizes Ii x R and I2 x R, then 
C = A © B is a matrix of size /1/2 x R such that 



where the Kronecker product of two vectors of size Ii and I2 is a vector of length /1/2 
given by 



Matricization of a tensor. The mode-n matricization or unfolding of a tensor % 
is denoted by X(„) and is of size /„ x J„ where J„ = Yim^n-^n- In this case, tensor 
element i maps to matrix element (i, j) where 



C = [ai (g) bi a2 ® b2 • • • a^ (g) b^.] , 



flib 
a2b 



a© b = 



1 = 1, 



'Ti 



and 




Appendix B. Proof of Lemma 3.1. In this section, we provide a proof for 
Lemma 3.1. We first establish two useful lemmas. 



Tensors, Sparsity, and Nonnegative Factorizations 



21 



Lemma B.l. Let X be fixed, let M = [A; A^^^ . . . , A^^^l, and let f{M) be the 
objective function as in (3.1). ///(M) < C for some constant C > 0, then there exists 
constants > (depending on DC and (^) such that e^A € [C'lC]- 

Proof. Because the factor matrices are column stochastic, we can observe that 



/(M) = e^A - ^ log ■ ■ ■ a^l) ' 

/ N \ 

> e^A — -fflog (e^A) where i? — j J„ j max Xi. 



(B.l) 



We have C > e^A — dlog (e^A). Let g{a) ~ a — i?log(a) where a > 0. We show 
that g{a) < C, implies there exists > such that a E [CiS,]- First assume there 
is no such lower bound Then there is a sequence q;„ tending to zero such that 
gictn) < C- But for sufficiently large n, we have that — 'i?log(a„) > C- Since a„ > 
for all n, we have that for sufficiently large n the function g{an) > (■ Therefore, there 
is such a lower bound 

Now suppose there is no such upper bound ^, and therefore there is an unbounded 
and increasing sequence Q!„ tending to infinity such that g{oin) < C for ^-H n. Note 
that g' {ol) ^ \ — d / a. Since g{a) is convex, we have that 

5(a) > g{2^) + g' {2d){a - 2t9) = g{2d) + ]^a-d. 

This inequality, however, indicates that for sufficiently large n, the right hand side is 
greater than Therefore, there must be an upper bound ^. Substituting a = e^A 
completes the proof. □ 

Lemma B.2. Let % be fixed, and let fCM.) be the objective function as in (3.1). 
Let il(C) be the convex hull of the level set of f as defined in (3.3). The function /(3Vt) 
is bounded for all 3Vt G f^(C)- 

Proof Let M, M G { M | /(M) < C }■ Define M to be the convex combination 

M = aM + {1 - a)M where a e [0.5,1). 

Note that the restriction on a is arbitrary but makes the proof simpler later on. 
Observe that 

= E I {^~^r + (1 - a)Xr) n + (1 - a)a^Jr) I 

r n ) 

On the one hand, by Lemma B.l, there exists ^ > such that 

TTii < ^ {aXr + (1 - a)Xr^ a ^ + (1 - a) ^ < + (1 - a)^ = 



On the other hand, 



Zi > ^ < aXr ( 



m\ > } \ aAr I I aa- ; > = a ' mj 



Thus, 
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Now consider 

fhi — Xi log rhi < frii + — Xi log a^^^rfii 

= ( fhi — Xi log mi) + C — {N + l)xi log a 
< {rhi - Xi log fhi)+^+{N + l)xi log 2. 

Thus, 

n i \ n / i 

Given these two lemmas, we are finally ready to provide the proof of Lemma 3.1. 

Proof, [of Lemma 3.1] Fix C- If { M e 17 | /(M) < C } is empty, then 12(C) 
is empty and there is nothing left to do. Thus, assume { M G $7 | /(M) < C } is 
nonempty. Since / is continuous at all M G 11 for which /(M) is finite, / is obviously 
continuous on il(C) by Lemma B.2. Since / is continuous, { M £ 17 | /(M) < C } is 
closed because it is the preimage of the closed set (— oo,C] under /; thus, 17(C) is 
closed because it is a convex combination of closed sets. Consequently, we only need 
to show that 17(C) is bounded. Assume the contrary. Then there exists a sequence of 

G 17(C) such that e^Afc — >■ oo. By Lemma B.2, 



models Mfc — 



/(M) is finite on 17(C), but this contradicts Lemma B.l. Hence, the claim. □ 
Appendix C. Deriving the MM updates. 

In this section we derive the MM update rules used to solve the subproblem. 
We first verify that (4.2) majorizes (4.1). For convenience let C = so that (4.1) 
reduces to 

min/(CT) = ^ - X,, log {cj^,) (C.l) 

Proofs of the next two lemmas are given by Lee and Seung in [30] but their 
arguments do not carefully handle boundary points. The following two lemmas and 
their proofs treat with more rigor the existence and value of updates when anchor 
points lie on admissible regions of the boundary. 

Lemma C.l. Let x > be a scalar and tt > 0, tt ^ 0, be a vector of length R. 
For a vector c > 0, c ^ 0, of length R, let the function f be defined by 

/(c) = cJir — x\og (c''^7r) . 

Then f is majorized at c > by 

^ / \ 



g{c, c) = c^TT — X ar log ( j where CLr = 



Proof. If a: = 0, then ^(c, c) = /(c) for all c, and g trivially majorizes / at c. 
Consider the case when a; > 0. It is immediate that (7(0, c) = /(c). The majorization 
follows from the fact that log is strictly concave and that we can write c^tt as a convex 
combination of the elements c^ix^ja,.. Note that if any elements c.^.tx^ are zero, they 
do not contribute to the sum since we assume • log(/i) — for all /i > and oir — 0. 
□ 
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We now derive an expression for the unique global mininiizer of majorization. 
The majorization defined in (4.2) can be expressed in terms of C as 



rij 



log 



where a^ij = '''' — . (C.2) 



Lemma C.2. Let f and g be as defined in (C.l) and (4.2), respectively. Then, 

for all C > such that /(C^) is finite, the function g{-,C) has a unique global 
minimum which is given by (C*)^.^ = otrijXij where arij — CriTTrj/cJnj, for 
all r = 1, . . . , R, i — 1, . . . , I . 

Proof. Because ^(C, C) separates in the elements of C we focus on solving each 
elementwise minimization problem. Dropping subscripts, the minimization problem 
with respect to Cri can be rewritten as 

- j \ J / 

where we have used the fact that — ^- It is sufhcicnt to prove that this 

univariate problem has a unique global minimizer, c* — c^j^j- First, consider the 
case where the second term is nonzero. Inspecting the stationarity condition reveals 
the solution. Moreover, the function is strictly convex and so has a unique global 
minimum. Second, consider the case where the second term is zero. Then, it is 
immediate that the unique global minimum is c* = 0. Moreover, the second term can 
only vanish when ctjXj = 0, and so the formula applies. □ 

Appendix D. Proof of Theorem 4.3. In this section, we prove the MM 
Algorithm in Algorithm 2 solves (4.1). We first establish the following general result 
for algorithm maps. Part (a) is a simple version of Zangwill's convergence theorem 
[52, p. 91] in the case where the objective function and algorithm map are both 
continuous. The proof of part (b) follows arguments of part of a proof for a different 
but related property on MM iterates in [27, p. 198]. 

Theorem D.l. Let f be a continuous function on a domain V, and let ip be a 
continuous iterative map from V into V such that /(■(/'(x)) < /(x) for all x € V with 
ipi'x) 7^ X. Suppose there is an Xq such that the set >C^(xo) = { x G I? | /(x) < /(xq) } 
is compact. Define x^+i — ipi^k) for k = 0,1, . . .. Then (a) the sequence of iterates 
{xfe} has at least one limit point and all its limit points are fixed points of ip, and 
(b) the distance between successive iterates converges to 0, i.e., Hx^+i — Xfe|j — >■ 0. 

Proof. The proof of (a) follows that of Proposition 10.3.2 of [27]. First note that 
the sequence of iterates must be in £/(xo) because /(x^) < /(xq) for all k. Since 
£/(xo) is compact, {xfe} has a convergent subsequence whose limit is in £j^(xo); denote 
this as Xfej — >■ X* as ^ — ?• oo. Since / is assumed to be continuous, lim/(xfej) = f{^*)- 
Moreover, clearly /(x») < /(x^,^) for all k(. 

Note that f{ip{xki)) < /(x^J. Taking the limit of both sides and applying the 
continuity of ip and /, we must have that /(■(/'(x*)) < /(x*). But we also have that 

/(X*) < /(Xfc.^J < /(Xfe, + l) = fWXk,)). 

Again taking limits we obtain /(x*) < f{4>{x^)). Therefore /(x*) — /(-(/^(x*)). But 
by assumption, this equality implies that x* is a fixed point of ip, and thus (a) is 
proven. 
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We now turn to the proof of (b), which follows the proof of Proposition 10.3.3 
in [27]. Recall {xj.} denotes the iterate sequence. Since /(x^.) is decreasing and / is 
bounded below on £/(xo), we can assert that /(x^) is a convergent sequence with a 
limit /*. Assume the contrary of (b), i.e., that there exists an e > and a subsequence 
{k(} of the indices such that 

||xfcj+i - Xfc^ II > e for all kg. (D.l) 

Note that this subsequence is different from the one discussed in proving part (a). 
Since x^^ e £/(xo), by possibly restricting {ke} to a further subsequence, we may 
assume that x^-^, converges to a limit u. By possibly restricting {ki} to yet a further 
subsequence, we may additionally assume that Xfe^-|_i converges to a limit v. By (D.l), 
we can conclude ||v — u|| > e. Note that Xfe^+i = ^{xkf). Taking the limit of both 
sides and using the continuity of we obtain ip{u) = v. Additionally, using the 
continuity of /, 

/(u) = lim /(Xfej = /* = lim /(xfc,+i) = /(v). 

Since v = we have that /(u) = f{ip{u)) which by assumption occurs if and 

only if u = 'ijj{u). This implies that u = v, and we have arrived at a contradiction. □ 
We now prove a series of lemmas leading up to a proof of the desired convergence 
result. 

Lemma D.2. Let B > such that /(B) is finite and suppose B 7^ B * Then 
/(B)>/(B**). 

Proof. By Lemma C.2 (B * $)^ is the unique global minimizcr of g(-, B^) which 
majorizes / at B^. Therefore, if B 7^ B * we must have /(B) = g(B^,B^) > 
.g((B*$)T,B"^) > /(B**). □ 

Lemma D.3. Let f be as defined in (4.1). For any nonnegative matrix Bq such 
that /(Bo) is finite, the level set £/(Bo) = { B > | /(B) < /(Bq) } is compact. 

Proof. The proof follows the same logic as the proof for Lemma B.l. □ 

Lemma D.4. Let f be as defined in (4.1) and ip be as defined in (4.3), and suppose 
Assumption 4.2 is satisfied. For any nonnegative matrix Bfe such that /(Bq) is finite, 
the sequence B^+i = V'(Bfc) converges. 

Proof. Note that all limit points of ip are fixed points of / by Theorem D.l. 

First, we show that the set of fixed point is finite. Suppose that B is a fixed point 
of ip. Then we must have B * (E — ^'(B)) = 0. By Lemma 3.4, it can be verified that 
B is the unique global minimizer of 

min /(U) s.t. U e { U > I u„ = if 6„ = } , 

where / is as defined in (4.1). Therefore, any fixed point that has the same zero 
pattern of B must be equal to B. Since there are only a finite number of possible zero 
patterns, the number of fixed points is finite. 

Since every limit point is a fixed point by Theorem D.l(a), there are only finitely 
many limit points. Let {Afp} denote a collection of arbitrarily small neighborhoods 
around each fixed point indexed by p. Only finitely many iterates B^ arc in >Cj(Bo) — 
UpAfp. So, all but finitely many iterates B^ will be in UpMp. But ||Bi:+i — Bfc|| 
eventually becomes smaller than smallest distance between any two neighborhoods by 
Theorem D.l(b). Therefore the sequence B^, must belong to one of the neighborhoods 
for all but finitely many k. So, any sequence of iterates must eventually converge to 
exactly one of the fixed points oi tp. D 
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We now argue that it is impossible for the MM iterate sequence to converge to a 
non-KKT point if it has been appropriately initialized. 

Lemma D.5. Let f be as defined in (4.1) and suppose Assumption 4.2 is satisfied. 
Suppose Bj; — >■ is a convergent sequence of iterates defined by (4.3) with Bg > and 
/(Bo) finite. //(Bo)„ > for all (i,r) such that (*(B*))„ > 1, then V/(B,) > 0. 

Proof. We give a proof by contradiction. Suppose there exists (i,r) such that 
(Bo)j^ > but (V/(B*))ir < 0. Since B* is a fixed point of t/i, we must have [1 — 
(€>(B*))ir](B*)ir = 0. By our assumption, however (V/(B*))ir = [l-(*(B*)),jr] < 0. 
Thus, we must have (B*)^^ = 0. On the other hand, (Bfc)^^ > for all k (proof left to 
reader). Since $(•) is a continuous function of B on £/(Bo), we know that there exists 
some K such that k > K implies B^ is close enough to B* such that (V/(Bfe))ir — [1 — 
(*(Bfc))jr] < 0. Since (Bfe)„. > 0, we have [1 - (*(Bfe))„](Bfe)„ < 0, which implies 
(Bfe)ir < (Bfc+i),> for all k > K. But this contradicts liink^oo{'Bk)ir — (B*)^^ = 0. 
Hence, the claim. □ 

We now prove Theorem 4.3. 

Proof, [of Theorem 4.3] By Lemma D.4, the sequence {B^} converges; we call 
the limit point B,^. At this limit point, we have: (a) B* > 0, (b) V/(B») > by 
Lemma D.5, (c) and B* * V/(B*) = by virtue of B* being a fixed point of V'- Thus, 
the point B* satisfies the KKT conditions with respect to (4.1). Furthermore, since 
/ is strictly convex by Lemma 3.4, we can conclude that B* is the global minimum 
of/. □ 

Appendix E. Numerical experiment details for §6.1. All implementations 
are from Version 2.5 of Tensor Toolbox for MATLAB [4]. All methods use a common 
initial guess for the solution. 

• Lee-Seung LS: Implemented in cpjimu as descibed in [3]. We use the default 
parameters except that the maximum number of iterations (maxiters) is set 
to 200 and the tolerance on the change in the fit (tol) is set to 10^'*. 

• CP-ALS: Luplemented in cp_als as described in [3]. We use the default 
parameter settings except that the maximum number iterations (maxiters) 
is 200 and the tolerance on the changes in fit (tol) is 10~*. 

• Lee Seung KL: Implemented in cp_apr as described in this paper. The pa- 
rameters are set as follows: fcmax — 200 (maxiters), fmax = 1 (maxinner iters), 
r = 10~^ (tol), K = (kappa), e (epsilon). 

• CP- APR: Implemented in cp_apr as described in this paper. The parame- 
ters are set as follows: fcmax = 200 (maxiters), £,nax = 10 (maxinner iters), 
r = 10""* (tol), K ^ 10"^ (kappa), Kto\ = 10"^° (kappatol), e = (epsilon). 

We compare the methods in terms of their "factor match score," defined as follows. 
Let M = lA; A(i\ . . . , A^^^] be the true model and let M = JA; a'^\ . . . A*^^] be 
the computed solution. The score of M is computed as 



The FMS is a rather abstract measure, so we also give results for the number of 
columns in A^^-* that are correctly identified. In other words, we count the number 
of times that the cosine of the angle between the true solution and the computed 




where = J| ||a(") || and = [| I 



n 
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solution is greater than 0.95, mathematically, r II > 0.95. We use 

the first mode, but the results are representative of the other modes. 

Appendix F. Additional Enron results. 

Figure F.l illustrates the four components omitted in Figure 6.2. 
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Fig. F.l: Remaining components from factorizing the Enron data 
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